Skip to content

fix: GC orphan summaries + WASM savepoint nesting [WIP 4/5]#6

Closed
andreinknv wants to merge 2 commits into
pr3-split-provider-and-summary-batchingfrom
pr4-gc-wasm-fixes
Closed

fix: GC orphan summaries + WASM savepoint nesting [WIP 4/5]#6
andreinknv wants to merge 2 commits into
pr3-split-provider-and-summary-batchingfrom
pr4-gc-wasm-fixes

Conversation

@andreinknv
Copy link
Copy Markdown
Owner

Stress-test fixes — PR 4 of 5 (stacked). Base: pr3-split-provider-and-summary-batching.

Commits

  • 1e5fa19 fix: GC orphan summary rows after each summarize pass
  • 9adac63 fix: prune orphan directory_summaries; WASM adapter savepoint nesting

Summary

Symbol-summary GC (1e5fa19)

Surfaced after running --force with two different chat models on the same repo: symbol_summaries held 723 rows — 675 fresh + 48 stale from the prior model whose node_ids no longer exist. Cache-preservation in PR 2 left them behind; the by-content-hash fallback didn't reclaim them onto current node_ids.

pruneOrphanSummaries() runs DELETE FROM symbol_summaries WHERE node_id NOT IN (SELECT id FROM nodes) inside a transaction. Embedding rows cascade-delete via existing FK. Wired AFTER summarizeAll (so the content-hash reclaim has had its chance) and BEFORE embedding (so embeddings only run on live rows).

Directory-summary GC (9adac63)

Same issue for directory_summaries. Live-dir set computed via path.posix.dirname(filePath) — exactly mirroring what dir-summarizer.ts writes. Crucially does NOT walk the ancestor chain (the first draft did, which let stale intermediate-dir summaries survive — e.g. a src summary from when files lived directly in src/ but have since all moved into src/core/). Regression-tested.

WASM savepoint nesting (9adac63)

WasmDatabaseAdapter was using bare BEGIN/COMMIT for nested transactions, which sqlite-wasm rejects. Switched to savepoint-based nesting. Also exported the adapter from the package index so callers can construct it directly (the only way to exercise WASM on Mac, where createDatabase always picks better-sqlite3).

Test plan

  • Suite green at 1113 / 13 skip / 0 fail.
  • New tests in __tests__/migrations-022.test.ts cover live-preserved + orphan-pruned + ancestor-only-pruned scenarios.

WIP draft.

andreinknv and others added 2 commits April 30, 2026 19:52
Surfaced by end-to-end measurement: after running `--force` with two
different chat models (Phase A-MLX with qwen3-coder, then Phase A
bridge+batched with claude-haiku-4-5), the codegraph self-repo's
symbol_summaries table held 723 rows — 675 fresh haiku entries plus
48 stale qwen3-coder entries whose node_ids no longer exist in the
current `nodes` table. The orphans accumulated because the cache-
preservation work in bd26945 (clearStructural with FK toggle) left
summary rows behind on every --force, and rows that the by-content-
hash fallback in summarizeAll didn't reclaim onto a current node_id
were never cleaned up.

Bookkeeping noise (inflated row counts, misleading Summary coverage
metrics) and slow content_hash index growth — no logical bug, but
worth fixing while the design is fresh.

## Implementation

New `QueryBuilder.pruneOrphanSummaries()` runs:

  DELETE FROM symbol_summaries WHERE node_id NOT IN (SELECT id FROM nodes)

inside a transaction, with before/after row counts captured atomically
in the same transaction. Returns `{ summariesDeleted, embeddingsDeleted }`.
Embedding rows cascade-delete via the existing FK CASCADE on
`symbol_embeddings.node_id REFERENCES symbol_summaries(node_id)`.

Wired into `startBackgroundSummarization` to run after Phase 1
(summarize) completes, before Phase 2 (embed). Logged via logDebug
only when something was actually deleted.

## Timing rationale

MUST run AFTER summarizeAll completes — by then any genuinely
relocated summaries (function moved files, line shifted, extraction-
version drift) have been re-attached to their new node_ids via the
content-hash fallback inside summarizeAll. Calling DURING summarizeAll
would race the reclaim and lose cache hits.

NOT called in no-chat configurations: orphans there might still be
legitimately reclaimable by a FUTURE chat-enabled session via the
content-hash fallback. Pruning prematurely would force the next chat
session to regenerate from scratch. Acceptable cost (storage growth in
configs that never gain a chat backend) for correctness in the more
common case. Documented in the method JSDoc.

## Tests

`__tests__/migrations-022.test.ts` — 2 new tests:

- `pruneOrphanSummaries deletes only summaries whose node_id is gone`
  — sets up 3 nodes + 3 summaries + 2 embeddings, simulates
  clearStructural by toggling FKs OFF and DELETing 2 nodes (the exact
  post-clearStructural-then-reindex state), then prunes. Asserts:
  2 summaries deleted, 1 embedding cascade-deleted, 1 live summary
  + 1 live embedding survive.

- `pruneOrphanSummaries is a no-op when nothing is orphaned` — asserts
  zero deletions and zero state change when all summaries have matching
  nodes (idempotent on healthy DB).

Suite: 1095 / 13 skip / 0 fail (was 1093).

## Reviewer trail

Two passes. Pass 1: REQUEST_CHANGES + 1 substantive finding (the four
COUNT(*) reads were taken outside the DELETE transaction, leaving an
inter-process race window where the embeddingsDeleted count could be
wrong) + 2 info findings (directory_summaries scope, no-chat
limitation). Substantive fix: wrapped all four counts AND the DELETE
inside a single this.db.transaction(() => { ... return obj; })() so
the snapshot is atomic. Info finding 3 addressed by adding a JSDoc
paragraph documenting the no-chat-session reasoning.

Pass 2: APPROVE + 1 info finding (the WASM adapter doesn't support
nested transactions — not actionable here since pruneOrphanSummaries
is never called from another transaction).

## Out of scope (deferred to follow-up)

- `directory_summaries` rows for deleted directories accumulate
  similarly. Different keying (dir_path, not node_id) means a different
  prune query. Worth a future PR.
- WASM adapter savepoint support — the WasmDatabaseAdapter.transaction()
  uses bare BEGIN/COMMIT, no savepoints. Future callers that nest
  pruneOrphanSummaries inside another transaction would hit a SQLite
  "cannot start a transaction within a transaction" error on WASM only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the two follow-ups noted in 1e5fa19's commit message:

## directory_summaries orphan pruning

Same issue `pruneOrphanSummaries` fixed for symbol_summaries, but for
directory_summaries — when a dir is deleted/renamed/excluded its
summary lingers indefinitely (no FK anchor; dir_path is the PK).

`QueryBuilder.pruneOrphanDirectorySummaries()`:
- Reads all `files.path` rows.
- For each file, computes the IMMEDIATE PARENT directory via
  `path.posix.dirname(filePath.replace(/\\/g, '/'))` — exactly
  mirroring what `dir-summarizer.ts:groupByDir` writes. Crucially
  does NOT walk the ancestor chain (the first draft did, which would
  let stale intermediate-dir summaries survive — e.g. a `src` summary
  from when files lived directly in `src/` but have since all moved
  into `src/core/`).
- Inserts the live set into a `TEMP TABLE _live_dirs` and runs
  `DELETE FROM directory_summaries WHERE dir_path NOT IN (...)`.
- All four COUNT(*) reads + the DELETE share one transaction so the
  before/after snapshot is atomic (same race fix as
  `pruneOrphanSummaries`).
- Returns `{ directorySummariesDeleted }`.

Wired into `startBackgroundSummarization` to run after Phase 3 (dir
summarizer). Same timing rationale as symbol prune: must run AFTER
the writer has had its chance to refresh stale entries, or we'd race
its writes for newly-summarised dirs.

### Tests in `__tests__/migrations-022.test.ts` (3 new)

- Live dirs preserved, orphan dirs pruned (mixed scenario with
  4 indexed files spanning src/a, src/b, and src directly).
- Empty files table prunes everything (degenerate case).
- **Stale ancestor-only dir IS pruned** — regression test for the
  reviewer-caught under-pruning bug. Indexes only `src/core/foo.ts`,
  plants summaries for both `src/core` (live) and `src` (stale
  ancestor-only). Asserts only `src/core` survives. Pre-fix this
  would have left `src` alive due to the ancestor walk.

## WASM adapter savepoint nesting

`node-sqlite3-wasm` doesn't transparently handle nested transactions
like better-sqlite3 does. Pre-fix, a caller wrapping a `transaction()`
inside another `transaction()` would get "cannot start a transaction
within a transaction" — only on the WASM backend, silently OK on
native. The reviewer flagged this in 1e5fa19's pass 1 as out-of-scope.

`WasmDatabaseAdapter.transaction()` in `src/db/sqlite-adapter.ts`:

- Added `private _txDepth = 0` field tracking nesting depth on the
  connection.
- Outer call (depth 0) → BEGIN/COMMIT/ROLLBACK.
- Inner call (depth > 0) → SAVEPOINT/RELEASE/ROLLBACK TO + RELEASE.
- Savepoint name `cg_sp_${depth}` is depth-tagged so concurrent depth
  values produce stable distinct names; safe to inline (depth is a
  non-negative integer, no SQL-injection risk).
- Rollback paths wrapped in try/catch — if the rollback itself fails,
  the original error is annotated with a "(NOTE: ROLLBACK also failed:
  ...)" suffix before re-throw. Operators reading the log get a signal
  when connection state may be uncertain.
- `finally { this._txDepth-- }` restores depth on both success and
  failure paths.

`WasmDatabaseAdapter` is now exported (was un-exported `class`) so
tests can drive it directly on machines where better-sqlite3 is
installed and `createDatabase` would otherwise pick native. Brief
JSDoc on the export explains the test-only intent.

### Tests in `__tests__/wasm-savepoint.test.ts` (NEW, 8 tests)

- Single-level commit
- Nested commit (outer + inner both succeed)
- Inner throw + outer recovers (writes survive even though inner's
  SAVEPOINT was rolled back) — this is THE case that errored pre-fix.
- Outer throw → full rollback (zero rows survive)
- Depth counter resets on success AND failure paths (third-run-after-
  failure case proves the `finally` restores depth on both branches).
- Three-deep nesting with L3 failure rolled back inside L2.
- transaction() return value passes through.
- transaction() args forwarded to wrapped function.

Suite: 1106 / 13 skip / 0 fail (was 1095, +11 new tests).

## Reviewer trail

Two passes. Pass 1: REQUEST_CHANGES + 1 substantive (under-pruning
ancestor-walk bug — would let stale `src` summary survive when files
all moved into `src/core`) + 1 info (silently swallowed outer-rollback
errors). Substantive: rewrote prune to use immediate-parent only +
added regression test that would have failed pre-fix. Info: annotate
caught rollback errors onto the original exception message. Pass 2:
REQUEST_CHANGES + 1 finding (stale JSDoc still described the removed
ancestor-walk behavior) + 1 info (frozen-Error edge case in error
mutation — theoretical, not actionable). JSDoc fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@andreinknv
Copy link
Copy Markdown
Owner Author

Stack collapsed into wip-stress-base (now at 56fac32). Commits landed via the fast-forward of wip-stress-base from df96c6c to 56fac32; #3 shows as merged, #4-#7 are closed because their tips are already ancestors of the base. Branch preserved for reference.

@andreinknv andreinknv closed this May 1, 2026
@andreinknv andreinknv deleted the pr4-gc-wasm-fixes branch May 1, 2026 11:37
andreinknv added a commit that referenced this pull request May 1, 2026
… server-config flags

Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed:
  #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale
  #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema
  #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe)
  #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters
  #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string
  #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*`
  #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)`
  colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains
       `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed`

CLI/MCP alignment (5/32 → 33+/35):
  - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar,
    biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs,
    module-summary, role, coverage-query, pending-summaries, save-summaries, review-context
  - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync,
    codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit,
    codegraph_unlock, codegraph_affected

MCP server-level operator config (`codegraph serve --mcp`):
  - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing)
  - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model /
    --llm-api-key (operator LLM config; per-project config wins on conflict)
  - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders

Architectural cleanups:
  - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces
    growing string-comparison chain in execute())
  - `withAllowStale` registry injection only on tools that DO see the gate
  - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts`
  - `LlmClient.isEmbeddingReachable` for split-provider correctness
  - SyncResult `lockContention` flag → handleSync emits distinct retryable message
  - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force)
  - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs)

Migrations:
  023-files-is-test.ts — add `files.is_test`
  024-build-context-refs.ts — add `build_context_refs` table

Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes:
  - JSONC URL strip via state machine (was eating `https://` tails)
  - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent)
  - Dynamic imports also match template_string nodes
  - process.exit deferred until after finally cleanup in runViaMCP
  - --same-language / --different-language mutual exclusion guard
  - help-tools CLI bypasses isInitialized (works without a project)
  - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks)
  - handleAffected errors instead of silently dropping unsupported glob filters
  - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block

Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
After the structural diff lands, the natural next question is "did
my edits introduce or clean up any biomarker findings?". Pre-PR the
agent had to read the persisted index AND remember the baseline
state — neither of which travels well across an edit-loop. This
extension answers it from a git ref + in-memory analysis only, no
fresh sync required.

How it works
------------
For each file already extracted by compareToRef, run the existing
per-file biomarker engine (parseSource + findNodeInTree +
computeMetrics + evaluateRules) on BOTH sides. Diff findings by
(qualifiedName, kind, biomarker):
- only-in-after  → added   (new regression)
- only-in-before → cleared (refactored away)
- both           → carried (status quo, surfaced as a count only
                            so the report doesn't double in size)

Cross-file biomarkers (unused_export, god_class, feature_envy) are
out of scope: they need the whole graph at the baseline ref, which
would require persisting and replaying graph state. Per-file rules
already cover 8 of 11 biomarker types, which is the dominant signal
for an edit-loop self-report.

Surfaces
--------
- src/compare/index.ts — new FindingDelta / FindingsDelta exports,
  evaluateFileFindings + computeFindingsDelta helpers, findingsDelta
  flag on CompareOptions, optional findingsDelta field on FileDelta.
- src/mcp/tools/compare.ts — new findingsDelta arg + render sections
  ("+ findings introduced", "− findings cleared", "carried over: N").
- src/bin/codegraph.ts — --findings-delta flag on compare-to-ref CLI.
- src/biomarkers/index.ts — exports ANALYSABLE_KINDS and
  ANALYSABLE_MIN_LOC (the in-memory analysis needs the same gate as
  the indexed pass).

Tests (4 new in __tests__/compare.test.ts)
- regression case: gnarly edit on alpha → added findings on alpha
- cleanup case: gnarly baseline + clean current → cleared findings
- carry case: gnarly baseline + slightly-tweaked-but-still-gnarly
  current → carried > 0, added/cleared = 0
- no-change case: empty findings delta when nothing differs from REF

Suite: 1370/13/0 (was 1366; +4 new). tsc clean.

Reviewer pass
-------------
APPROVE with two info-level fixes addressed before commit:
- carried bucket was populated but never rendered → added a
  collapsed "N pre-existing findings carried over" line.
- carried path was untested → added the tweaked-but-gnarly case.

Dogfooded on this very PR; the report correctly surfaced findings
my own changes introduced, including a long_parameter_list warning
on computeFindingsDelta that I bundled into a FileSnapshot interface
before commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 3, 2026
…olbymchenry#8)

Promotes the resolver numeric `metadata.confidence` to a categorical,
queryable column on `edges`. `codegraph_callers`, `_callees`,
`_impact` now annotate each row with `*(INFERRED)*` when the edge is
not concrete, and accept a new `minConfidence` arg to drop edges
below a chosen level — useful for refactor-impact analysis where
false-positive reach inflates the blast radius.

Migration 032 adds `edges.confidence TEXT` (nullable; read-cast in
EDGE_SCHEMA collapses NULL → `EXTRACTED`). Nullable lets legacy /
structural / extractor-direct edges need no producer change. Index
on the column for filter-down queries. Idempotent ADD COLUMN with
table-exists guard following the convention from migrations 020 /
021 / 023 / 030.

Producer: `ReferenceResolver.createEdges` calls a new free
`classifyConfidence(resolvedBy, score)` helper:
  - import / qualified-name / file-path / exact-match → EXTRACTED
  - framework with score >= 0.85 → EXTRACTED, else INFERRED
  - instance-method / fuzzy → INFERRED
AMBIGUOUS is not stamped from the resolver path in v1 — needs
tie-tracking from the candidate picker. Tracked as B3.

Consumers (`result-formatters.ts`):
  - `formatConfidence(edge)` — markdown italic suffix (empty for
    EXTRACTED; `*(INFERRED)*` / `*(AMBIGUOUS)*` otherwise).
  - `parseMinConfidence(raw)` — returns level / null / errorResult.
  - `filterByConfidence(rows, min)` — drop below threshold.
  - `CONFIDENCE_RANK` — numeric ordering for inline filter loops.
Wired into all four code paths in `callers.ts` (collectCallers,
collectCallersForSource, collectTypeUsers, formatGroupedCallers),
both paths in `callees.ts`, and `impact.ts` with a
`filterImpactByConfidence` BFS that prunes nodes unreachable along
the surviving edge set.

Type cleanup: `Edge.confidence` added with JSDoc; the long-dead
`Edge.provenance` field also got a JSDoc note pointing at the
SCIP-integration-reservation explanation in memory.

Reviewer-memo #4 catch (schema-version test forgetfulness): noticed
`__tests__/pr19-improvements.test.ts` was still asserting `toBe(29)`
even though migrations 030 and 031 had shipped. Catch-up to 32 in
this commit covers both the missed bumps and the new migration.

Reviewer: APPROVE, three info-only findings (type-user display gap,
framework-rule calibration table doc, catch of the recurring pr19
pattern).

Suite: 1555 / 34 / 0 (+18 new edge-confidence tests).
Eval: 11/11 passed | recall=1.00 | mrr=0.86 (within regression
budget vs 091e935 baseline).

Phase 3 done. Backlog now down to: #6 TOON (deferred), colbymchenry#11
trace_to_culprits, colbymchenry#17/colbymchenry#18 Phase 7, colbymchenry#19 streaming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 4, 2026
…sites (colbymchenry#11)

Last Phase 1-6 backlog item. Replaces the "agent stitches
codegraph_callers + _history + _biomarkers manually for every stack
frame" loop with one tool: paste a stack trace, get a ranked list of
likely fix-site candidates with per-row "why suspected" reasons.

## Pipeline

1. `parseTrace(text)` — multi-format extractor: V8/Node, Python,
   Java/Kotlin/Scala, Go, Rust, plus a generic `path/to/file.ext:line`
   fallback. Caps at 200 frames; dedupes by `(file, line)`; normalises
   Windows backslashes; first-pattern-per-line wins.
2. `resolveTraceFile` — longest-suffix match between absolute trace
   paths and project-relative indexed paths, with a path-boundary
   check so `foo/bar.ts` doesn't match `sub-foo/bar.ts`.
3. `findEnclosingNode` (existing helper from index-hooks/enclosing.ts,
   shared with codegraph_grep) attributes each frame to its enclosing
   function/method/class via line number.
4. `scoreCulprits` composite: `1 / (topRank + 1)` for frame position
   (top of stack wins decisively), `+0.3` for risk-biomarker overlap
   (god_class / complex_method / large_method / nested_complexity),
   `+0.2` for recent file churn (touched within 30 days, with commit
   count surfaced in the reason).
5. Output: ranked candidates with per-row reasons + an "unmapped
   frames" footer (capped at 5 samples) so the agent sees what
   couldn't be attributed (`node_modules/`, generated files, etc.).

## CLI mirror

`codegraph trace-to-culprits` reads from stdin (`cat error.log |
codegraph trace-to-culprits`) or from `--trace "..."`. Routes via
runViaMCP per the repo convention.

## Reviewer cycle

Round 1 returned BLOCK + REQUEST_CHANGES + info — all addressed:
  - **R1 (BLOCK, correctness)**: `recentCutoff` was computed in ms
    but `last_touched_ts` is unix seconds — the `>=` comparison
    always evaluated false, killing the churn signal silently on
    every project. Fixed with `Math.floor(Date.now() / 1000)` and a
    new R1-regression-guard test that stamps a fresh `last_touched_ts`
    via the same `applyChurnDeltas` helper the churn miner uses, then
    asserts the rendered output mentions "recent churn".
  - **R2 (REQUEST_CHANGES, correctness)**: inline `cg.queries.db
    .prepare(...)` violated the per-domain query-file convention and
    re-prepared the same statement per culprit. Switched to
    `getFileByPath` from queries-files.ts (cached statement, returns
    FileRecord with `lastTouchedTs` and `commitCount`).
  - **R3 (info, perf)**: `resolveTraceFile` ran `getAllFiles` once
    per frame — N full-table scans for an N-frame trace. Hoisted to
    once-per-call at the top of `buildCulprits`.

## Tests

13 trace tests in `__tests__/mcp-trace-to-culprits.test.ts`:
  - parseTrace: V8 / Python / Java / Go / dedup / 200-frame cap /
    no-frames / Windows backslash normalisation
  - End-to-end: V8 trace → enclosing symbols ranked top-of-stack
    first, unmapped-frames footer, no-refs error message, limit cap,
    R1-regression guard for the churn signal

CLI dogfood: `cat trace | npm run cli:dev -- trace-to-culprits`
parses a real production-shaped V8 trace and maps frames to
`src/mcp/tools/trace-to-culprits.ts` symbols.

Suite: **1595 / 34 / 0** (+13 trace tests).
Eval: 18/18 | recall=1.00 | mrr=0.91 | within budget (re-baselined
after colbymchenry#11 added new files to the corpus, which shifted the
explore-pipeline case rank — case still PASSes its 0.5 threshold,
the relative regression was corpus drift, not behaviour change).

## Backlog after this

Phase 1-6 complete. Remaining: Phase 7 (colbymchenry#17 propose_extract /
propose_rename, colbymchenry#18 plan-and-execute CLI — both need user check),
Phase 8 (colbymchenry#19 streaming MCP), and #6 TOON (deferred pending
measurement).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 4, 2026
Backlog #6 proposed adopting TOON (Token-Oriented Object Notation,
header-once / rows-as-tuples) as a smaller alternative to the
current markdown formatters for tabular MCP responses. The backlog
explicitly gated the decision: "Verify actual savings on captured
queries before flipping default — the 30-60% claim is for ideal
tabular data."

This commit ships the measurement and the answer.

## Result

Eight representative queries against this project's own .codegraph,
covering search (long signatures), suggest (short rows), callers
(20-row mixed-confidence list):

  sample                                  rows  md(B)  toon(B)  saving
  ──────────────────────────────────────  ────  ─────  ───────  ──────
  search "CodeGraph"                        10    683      661  +3.2%
  search "extractFromSource"                10   1483     1444  +2.6%
  search "compareToRef"                      3    461      468  -1.5%
  search "handleSearch"                      6    760      757  +0.4%
  search "parseTrace"                        2    291      298  -2.4%
  suggest "CodGrap"                         10    408      409  -0.2%
  suggest "extracFromSorce"                 10    386      387  -0.3%
  callers of extractFromSource              14    936     1037  -10.8%

  TOTAL: md=5408B  toon=5461B  aggregate saving -1.0%

## Why the 30-60% claim doesn't apply

TOON's win comes from compressing a verbose JSON baseline like
`[{"name":"foo","kind":"function","file":"a.ts","line":42},…]` —
the "headers repeated per row" + JSON quoting waste is what its
header-once shape removes.

But our markdown is ALREADY row-shaped:
  - `### foo (function)` ≈ `foo,function` (no quoting).
  - `a.ts:42` ≈ `a.ts,42` (single-char delimiter, both compact).
  - `- name (kind) - file:line` is shorter than the comma-tuple form
    when names are short.

There's no fat to trim. On callers (the densest row shape) TOON is
10% LARGER because the per-row `- ` bullet syntax is a one-byte
overhead while TOON's commas waste two chars between every field.

## Decision

Skip TOON. Empirical answer matches the backlog's gate — the
savings aren't there for our markdown baseline. Adopting it would:
  - Add per-tool format selector code + tests.
  - Risk LLM-client misrender (TOON is new; some clients haven't
    trained on it).
  - Net zero to net negative on payload bytes.

The measurement script (`__tests__/evaluation/toon-measure.ts`)
stays in the repo as a permanent record + a re-runnable harness if
the markdown formatters ever get verbose enough to flip the math.
Run with `npx tsx __tests__/evaluation/toon-measure.ts`.

## Backlog

#6 closed. Phase 1-6 of the agentic backlog is now fully resolved
(every item shipped or empirically dismissed). Remaining: Phase 7
(colbymchenry#17 propose_extract / propose_rename, colbymchenry#18 plan-and-execute CLI),
Phase 8 (colbymchenry#19 streaming MCP) — all need user check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 9, 2026
…search

Subtask #6 of the embedding-features arc Stage 3. New module
src/llm/query-expansion.ts that asks the configured chat backend
for paraphrases of a search query and caches the result by
sha256(n + ':' + text). Wired into searchHybridBlendWithEmbeddings:
when the top semantic-hit cosine score < 0.5 AND a chat backend is
reachable, the function expands the query into 3 paraphrases,
embeds each, runs them through llmSemanticTopK, and max-pools the
candidate scores across the original + expansions.

Anti-goals enforced:
- Returns [] on LLM error / abort / unreachable backend so callers
  fall back to the unexpanded path without paying a query stall.
- LRU-ish cache bounded at 256 entries — bumps recency on read,
  evicts oldest on write at capacity.
- Filters out paraphrases identical to the original (de-dup) and
  lines >= 500 chars (LLM hallucination guard).

Threshold + paraphrase count live as module-private constants
(QUERY_EXPANSION_SCORE_THRESHOLD = 0.5, _PARAPHRASE_COUNT = 3).

Suite: 2136 -> 2163 (+27 tests covering input validation, LLM call
shape, output parsing, error handling, caching, duplicate filter).
Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
Adds `compact: true` to codegraph_callers + codegraph_callees inputSchema.
When set, the formatNodeList renderer emits one terse line per node:

  Callers of foo (3)
  bar|function|src/x.ts:42|id:n_a1b2c3d4
  baz|method|src/y.ts:120|id:n_e5f6g7h8

instead of the default markdown:

  ## Callers of foo (3 found)

  - bar (function) - src/x.ts:42 *(EXTRACTED)* (2 call sites: 12, 45) `[id: n_a1b2c3d4]`

Saves ~50-70% of output tokens on graph-walk chains. Default false
preserves the rich human-readable output for ad-hoc inspection.

Pattern is straightforward to extend to the other navigation tools
(search, refs, impact). Tracked as Stage 6 #6.1 follow-up.

Suite remains 2374/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
New MCP tool consolidating 5-10 chained codegraph_callers/callees
calls into one BFS traversal. Inputs:

- start: symbol name OR n_xxxxxxxx UID
- direction: 'callers' | 'callees' | 'impact' (bidirectional)
- hops: 1-5, default 2
- maxNodes: 1-200, default 50
- edgeKind?: filter to one edge kind
- compact?: default TRUE (this tool's purpose IS terse output)
- projectPath?: standard

BFS deduplicates via node_id Map, first-seen wins (shorter depth
preferred). Excluded edge kinds: similar_to / def_use / contains
(structural noise) on unfiltered walks. Output rows:
`name|kind|path:line|depth=N|via=parent_name|id:n_xxxxxxxx`.

26 tests in __tests__/mcp-walk.test.ts covering input validation,
BFS shape (hops/cycles/dedup/maxNodes cap), output format
(compact vs markdown), and registration.

Tool registered. Suite passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
…#6.2 + #6.3)

Closes the inline Stage 6 deferrals.

#6.2 since-tokens — codegraph_callers and codegraph_callees now
support `since: 'c_xxxxxxxx'` delta-mode chains. Same pattern as
codegraph_search / codegraph_grep / codegraph_explore (the
applyDeltaSince + mintCallId + appendCallIdFooter helpers from
shared.ts). Saves ~40-60% of output on multi-step walks where the
agent already saw most of the prior result.

#6.3 field projection — both tools accept a `fields: string[]` arg
that restricts compact-mode rows to a subset of {name, kind, path,
line, id}. Combined with `compact: true`, drops to the absolute
minimum the caller needs:
  fields: ["name", "path"] → `findSimilarViaVec|src/db/vec-helpers.ts`
Saves another ~50% on top of compact.

Both flags are opt-in; default behavior unchanged.

Suite remains 2433/0/34. Typecheck clean.

Stage 6 token-cost reductions now complete: #6.1 compact mode +
#6.2 since-tokens + #6.3 field projection + #6.4 codegraph_walk
all shipped. Cumulative effect on graph-heavy sessions: 60-80%
output-token reduction when all flags are set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 10, 2026
Reviewer pass on c895348..5730d98 caught 3 request_changes + 3 info,
all addressed. Plus the deferred staleness gating (info #6) and
playbook updates the user requested as a follow-up.

## Request_changes (real bugs)

1. **Chunk vec0 not cleared on full reset** (queries.ts + vec-helpers.ts):
   `clearAll` deletes nodes with FKs OFF, so the ON DELETE CASCADE on
   `symbol_chunk_embeddings` never fires. `clearVecTables` then only
   wiped `vec_symbol_embeddings_*` — leaving the chunk vec0 mirror
   tables full of ghost rowids that would poison the next index run's
   `findSimilarViaVec` chunk queries.
   Fixed: explicit `DELETE FROM symbol_chunk_embeddings` inside
   `clearAll`'s transaction; `clearVecTables` now also discovers and
   wipes `vec_chunk_embeddings_*` via the same LIKE pattern + regex
   whitelist.

2. **Docstring rot on upsertChunkEmbedding** (queries-chunk-embeddings.ts):
   JSDoc claimed "false when an existing row was updated" but
   `ON CONFLICT DO UPDATE` produces `changes=1` on both INSERT and
   UPDATE — the function returned true on both. Tightened JSDoc to
   match actual behavior; simplified the return condition to
   `result.changes > 0` (drops the meaningless lastInsertRowid check).

3. **codegraph_walk missing CLI mirror** (bin/codegraph.ts): the new
   MCP tool had no `program.command('walk')` counterpart, violating
   the repo's CLI/MCP alignment convention. Added a thin CLI command
   that delegates via runViaMCP — same shape as the other delegating
   subcommands.

## Info-level

4. **Deferred staleness gating shipped** (multi-vec.ts): previously
   re-embedded every long symbol's chunks on every sync. Now stamps
   the file's content_hash on each chunk row (in the previously-
   unused summary_hash_at_embed column, repurposed for multi-vec) and
   discoverLongSymbols' SELECT skips symbols whose stored hash
   matches the current files.content_hash. Idempotent re-syncs with
   no source drift now do zero embed work.

5. **parseFieldsArg duplication** (shared.ts + callers/callees):
   Extracted to `shared.ts` as `parseFieldsArg` + `CompactFieldName`
   type. Callers and callees both import the shared version.

6. **Playbook updates** (server-instructions.ts):
   - Added `codegraph_walk` use-case bullet under graph-walk tools
   - New "Token-cost flags" section documenting compact / fields /
     since on callers + callees
   - Updated `codegraph_risk_review` reference to the merged
     `codegraph_review({mode: 'risk'})` form

Skipped (genuine non-issues): the `since` delta filter on callers
not covering typeUsers (typeUsers is a small distinct section by
design, not part of the same-rank dedup); CRLF handling in
sliceNodeBody (would require explicit \\r\\n splitting; deferred
until Windows operators report regressions).

Suite remains 2433/0/34. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andreinknv added a commit that referenced this pull request May 11, 2026
…olved-by-FTS

Three more friction closures from the 2026-05-11 workout:

- src/extraction/index.ts + src/db/queries-files.ts: heal-marked
  files now propagate through the git fast-path. The
  EXTRACTION_LOGIC_VERSION self-heal sets needs_reextract=1 on every
  file, but the git fast-path eoCollectGitChanges only iterated
  files git reported as changed — so heal-marked files with no
  diff were silently skipped in any git-tracked project. Adds a
  getFilesNeedingReextract(qb) query and unions the result into
  the change set after the git fast-path returns. Full-scan path
  was already correct via eoClassifyFileChange; no change there.
  Closes friction colbymchenry#22 (surfaced while validating the cluster fix
  in 6957b74 — the original needs_reextract=1 → 620 of 627 stayed
  pending after sync, requiring admin index --force).

- src/mcp/tools/entry-points.ts: new "MCP tools" bucket scans for
  exported `*_TOOL` constants whose initializer signature contains
  `name: 'codegraph_*'` and surfaces them with the parsed canonical
  tool name. Closes friction colbymchenry#13 / cluster symptom #6: every MCP
  tool dispatcher previously fell through every bucket because
  routes/CLI commands/CLI-files/public-exports each had a different
  filter mismatch. Five buckets now (HTTP routes, CLI commands,
  MCP tools, CLI files, public exports). __tests__/mcp-entry-points
  grows 5 → 6 tests with the new bucket fixture.

- Friction colbymchenry#9 (semantic/intent search can't bridge tool-name →
  handler-name): resolved by existing FTS5 path. Investigation
  showed nodes_fts indexes the `signature` column at bm25=2, so
  searching `codegraph_X` already returns the matching XXX_TOOL
  constant via signature substring match. The morning-workout
  "No results" observation was likely a stale-index transient
  (pre-cluster-fix). Added a regression test in mcp-search-family
  to lock in the behavior.

Test impact: 2633 → 2635 (one new MCP-tools-bucket test, one new
canonical-tool-name regression test). Type-check clean.

Closes:
- colbymchenry#22 EXTRACTION_LOGIC_VERSION heal flag honored
- colbymchenry#13 entry_points 5th bucket for MCP tool dispatchers
- colbymchenry#9 already-resolved-by-FTS — locked in with regression test

Still open:
- colbymchenry#8 / cluster #1 (context whiff): graph edge now exists, but the
  context-builder ranker still surfaces ToolHandler/ToolResult/
  ToolCtx instead of dispatchers. Needs a context-ranker boost
  for `codegraph_*` queries. Logged as colbymchenry#24 follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant